Variable-length category-based n-grams for language modelling

نویسنده

  • T. R. Niesler
چکیده

This report concerns the theoretical development and subsequent evaluation of n-gram language models based on word categories. In particular, part-of-speech word classifications have been employed as a means of incorporating significant amounts of a-priori grammatical information into the model. The utilisation of categories diminishes the problem of data sparseness which plagues conventional word-based n-gram approaches, and therefore yields a fundamentally more compact model. Furthermore, it allows the use of larger n, and a strategy by means of which successively longer n-grams are selectively added to the model according to a cross-validation likelihood criterion is proposed. This enables the model compactness to be maintained while allowing longer range effects to be modelled where they benefit performance. The language modelling approach was applied to the LOB corpus in order to assess its effectiveness. When compared with models of corresponding complexity constructed according to conventional n-gram methods, it is found that the proposed procedures render language models exhibiting superior performance. Furthermore, comparison with word-based n-gram models shows that comparable performance may be achieved at a large reduction in model size. An ultimate aim of the described work is to construct language models from very large text corpora, the contents of which are generally not annotated with the required part-of-speech classifications. For this reason the use of the category-based language model as a statistical tagger is introduced as a means of automatically determining this information, and is shown by means of tests on the LOB corpus to yield very good tagging accuracies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A variable-length category-based n-gram language model

A language model based on word-category n-grams and ambiguous category membership with n increased selectively to trade compactness for performance is presented. The use of categories leads intrinsically to a compact model with the ability to generalise to unseen word sequences, and diminishes the spareseness of the training data, thereby making larger n feasible. The language model implicitly ...

متن کامل

Category-based Statistical Language Models Synopsis

Language models are computational techniques and structures that describe word sequences produced by human subjects, and the work presented here considers primarily their application to automatic speech-recognition systems. Due to the very complex nature of natural languages as well as the need for robust recognition, statistically-based language models, which assign probabilities to word seque...

متن کامل

Comparison of part-of-speech and automatically derived category-based language models for speech recognition

To appear in : Proc. ICASSP-98 c IEEE 1998 ABSTRACT This paper compares various category-based language models when used in conjunction with a word-based trigram by means of linear interpolation. Categories corresponding to parts-of-speech as well as automatically clustered groupings are considered. The category-based model employs variable-length n-grams and permits each word to belong to mult...

متن کامل

Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams

The multigram model assumes that language can be described as the output of a memoryless source that emits variable-length sequences of words. The estimation of the model parameters can be formulated as a Maximum Likelihood estimation problem from incomplete data. We show that estimates of the model parameters can be computed through an iterative Expectation-Maximization algorithm and we descri...

متن کامل

Token merging in language model-based confusible disambiguation

In the context of confusible disambiguation (spelling correction that requires context), the synchronous back-off strategy combined with traditional n-gram language models performs well. However, when alternatives consist of a different number of tokens, this classification technique cannot be applied directly, because the computation of the probabilities is skewed. Previous work already showed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995